-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324
REF: centralize pyarrow Table to pandas conversions and types_mapper handling #60324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change overall; just some questions
types_mapper: type[pd.ArrowDtype] | None | Callable | ||
if dtype_backend == "numpy_nullable": | ||
mapping = _arrow_dtype_mapping() | ||
if null_to_int64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is for compatability, but is this a feature or a bug that the CSV reader does this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, I didn't look in detail at why this is happening in the CSV reader, for now just wanded to keep the same behaviour (but this is certainly an ugly keyword, the problem with centralizing the conversion as I am doing, it's not otherwise possible to change it on the CSV side)
pandas/io/sql.py
Outdated
df = cur.fetch_arrow_table().to_pandas(types_mapper=mapping) | ||
pa_table = cur.fetch_arrow_table() | ||
dtype_backend = ( | ||
lib.no_default if dtype_backend == "numpy" else dtype_backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any harm to forwarding along the argument of "numpy"? Seems a little strange to handle this one-off in some invocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also strange that we use "numpy" as the default in the SQL code, and no_default
for all other places.
Ideally we would solve that inconsistency as well, I think.
Now, I can certainly forward it and handle it inside arrow_table_to_pandas
(I could also change the default here, and just additionally accept "numpy" for back compat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that seems like an issue in the SQL code. Looks like the read_sql API uses lib.no_default so that must be getting messed up in the internals. Let's keep this PR the way it is and track in #60326
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening that issue.
Now, in the end I moved it to the utility anyway to deal with the typing issues (the type checkers don't understand that I removed "numpy" from the options ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Still something to be cleaned up there, though not urgent for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ex code-checks fail (didn't look at it)
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Will backport |
Hmm looking at this some more @jorisvandenbossche it looks like the 2.3 branch in parquet.py requires there to be a |
Yeah, either just add that, or add a generic |
… conversions and types_mapper handling (cherry picked from commit 12d6f60)
Backport PR #60332 |
… conversions and types_mapper handling (cherry picked from commit 12d6f60)
…ns and types_mapper handling (#60332) (cherry picked from commit 12d6f60) Co-authored-by: Joris Van den Bossche <[email protected]>
We have defined this logic in several places, so defining one helper function to reuse that for the different IO formats.